1. Various Racial Groups
One of the most important factors that shape private insurance
status is an individual’s racial background. We wanted to understand the
relationship between private insurance coverage and race because,
according to a study conducted by the NIH, “blacks, Hispanics, and some
Asian populations when compared with whites, appear to have lower levels
of health insurance coverage” (National Research Council (US)
Panel on Race, Ethnicity, and Health in Later Life, 2004).
Understanding the relationship between private insurance coverage
and race will help shed light on which racial groups have the lowest
coverage, and this can lead to the development of policies to bridge the
coverage gap. In this dataset, the percentage of each racial group is
encoded as the variables “ACS_PCT_WHITE” representing White,
“ACS_PCT_BLACK” representing Black, “ACS_PCT_ASIAN” representing Asian,
“ACS_PCT_AIAN” representing American Indian and Alaska Native, and
finally “ACS_PCT_MULT_RACE” representing multiple races.
Understanding the distribution of race within each county can help
provide a baseline insight into the population demographics and context
for any interpretation of the analysis. The distribution of racial
groups across the county was determined using the pivot_longer()
function, which put all the races in one column. We named this new
combined variable “sdoh_long.”
Next, we renamed the original race to reflect actual races set by
the Census Bureau: White, Black, Asian, American Indian/Alaska Native,
and Multiracial. This makes it more comprehensible to our audiences.
Then, we created a histogram plot using the ggplot() function. The
x-axis represents the percentage of the county’s population belonging to
a racial group, while the y-axis shows the number of counties. Within
this function, the plotted data’s color, size, and opacity can also be
customized. Finally, x and y-axis labels, as well as a title, were added
using the xlab(), ylab(), and ggtitle() functions.
#visualization of each racial group in percentage for counties
#Convert data to long format
sdoh_long <- sdoh %>%
pivot_longer(cols = c(ACS_PCT_WHITE, ACS_PCT_BLACK, ACS_PCT_ASIAN, ACS_PCT_AIAN, ACS_PCT_MULT_RACE),
names_to = "Race_Ethnicity",
values_to = "Percentage")
#Rename categories for better readability
sdoh_long$Race_Ethnicity <- factor(sdoh_long$Race_Ethnicity,
levels = c("ACS_PCT_WHITE", "ACS_PCT_BLACK", "ACS_PCT_ASIAN", "ACS_PCT_AIAN", "ACS_PCT_MULT_RACE"),
labels = c("White", "Black", "Asian", "American Indian/Alaska Native", "Multiracial"))
#Create a single faceted histogram plot
ggplot(sdoh_long, aes(x = Percentage)) +
geom_histogram(color = "grey", fill = "lightblue", bins = 30) +
facet_wrap(~Race_Ethnicity, nrow = 2) + # Arrange plots in a 2-row layout
xlab("% of County Population") +
ylab("Number of Counties") +
ggtitle("Distribution of Racial/Ethnic Groups by County in the US") +
theme_minimal()

Based on the histogram distribution, it is observed that each race
has a skew to its data. The white population has a left skew, showing
that most of the data falls around 100% of the population for many
counties, which indicates that they are the predominant race in these
counties. This is quite the opposite for all the other racial groups,
which have a right skew showing their percentage is much closer to 0,
which shows they are the minorities and have a lower representation in
these counties, indicating a potential disparity in the dataset
observations that contextualizes the later research findings within a
population predominantly made up of White individuals.
Looking at the summary statistics for each race and the median, it
is seen that the most represented race in the data set is
white-identifying individuals. Having a median of 87.66% in comparison
to the least represented population, American Indian/Alaska Native,
having a median of 0.320%. Looking at the % population distribution of
White individuals, it is seen that they have a mean lower than the
median, which indicates a left skew, which was seen in the histogram
above. In addition, the other races all had a mean higher than the
median, indicating a right skew. We chose to look specifically at the
median because it resists outliers. With this information, we wanted to
plot the distribution of each racial group’s private insurance coverage
based on the county level.
Next, to integrate these findings with our broader interest in
health insurance, we wanted to see the relationship between private
insurance coverage and racial groups in the top 5 counties with the
highest private insurance coverage rates. To achieve this, we used
arrange(desc()) and slice(1:5) to identify these counties. The dataset
was then transformed into a long format using pivot_longer(), allowing
easier visualization. An interactive bar chart was created using
geom_bar() and ggplotly(), displaying the percentage of each racial
group in these top counties.
#Selecting the racial variables, county and private insurance
selected_racial_group_with_county <- sdoh %>%
select(COUNTY, ACS_PCT_WHITE, ACS_PCT_BLACK, ACS_PCT_ASIAN, ACS_PCT_AIAN, ACS_PCT_MULT_RACE, ACS_PCT_PRIVATE_ANY)
#Identify top 5 counties with highest private insurance coverage
top_counties <- selected_racial_group_with_county %>%
arrange(desc(ACS_PCT_PRIVATE_ANY)) %>%
slice(1:5)
top_counties
## # A tibble: 5 × 7
## COUNTY ACS_PCT_WHITE ACS_PCT_BLACK ACS_PCT_ASIAN ACS_PCT_AIAN
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Kalawao County 13.3 0.69 3.44 17.7
## 2 Falls Church city 78.0 4.82 9.88 0.09
## 3 Daggett County 97.6 0.17 0 0.34
## 4 Los Alamos County 84.2 0.81 4.92 0.94
## 5 Burke County 93.4 0.75 1.54 1.07
## # ℹ 2 more variables: ACS_PCT_MULT_RACE <dbl>, ACS_PCT_PRIVATE_ANY <dbl>
#Convert race columns into long format for visualization
selected_long <- top_counties %>%
pivot_longer(cols = c(ACS_PCT_WHITE, ACS_PCT_BLACK, ACS_PCT_ASIAN, ACS_PCT_AIAN, ACS_PCT_MULT_RACE),
names_to = "Race",
values_to = "Percentage")
#Creating an interactive bar plot
p <- ggplot(selected_long, aes(x = Race, y = Percentage, fill = Race,
text = paste("County:", COUNTY,
"<br>Private Insurance:", round(ACS_PCT_PRIVATE_ANY, 2)))) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ COUNTY, scales = "free_y") + # Facet by County only
labs(title = "Race Distribution in Top 5 Counties with Highest Private Insurance",
x = NULL, # Removes x-axis label to avoid repetition
y = "Percentage of Population") +
theme_minimal() +
theme(axis.text.x = element_blank(), # Removes race labels at the bottom
axis.ticks.x = element_blank(),
strip.text = element_text(size = 10, face = "bold"), # Adjust facet title size
legend.position = "bottom") # Moves legend to bottom
#Convert to interactive plotly graph
interactive_plot <- ggplotly(p, tooltip = "text")
interactive_plot
The interactive plot follows the pattern established previously
about the distribution of each race; the most represented are Whites.
However, in Kalawao County, we see that the race with the highest
representation is AIAN, which is an interesting departure from the
observed trend. Also, compared to the other counties, the other races,
like Black, Asian, and Multirace individuals, had a higher
representation in Kalawao County. An interactive plot, as opposed to
other plots, is used because it allows the audience to toggle, zoom in,
and perform many other actions between different elements on the
graph.
Lastly, we wanted to get the actual correlation between each racial
group and private insurance coverage to provide knowledge of the
strength of this relationship. We used the cor() function to calculate
the correlation coefficient for each racial group. The cat() function
assists in printing all four summary outputs simultaneously. The numbers
range from –1 to 1. The closer to one, the higher the correlation of the
two variables. If the value is negative, it indicates that as one
variable increases, the other decreases. However, if you have a positive
value, as one increases, so does the other variable. The racial group
with the highest correlation with private insurance having a moderate
positive correlation was White, having a value of 0.42.
#Print correlation coefficients for every association between race and private insurance at once
{
cat("\nCorrelation: % White vs. Private Insurance\n")
print(cor(sdoh$ACS_PCT_WHITE, sdoh$ACS_PCT_PRIVATE_ANY, use = "complete.obs"))
cat("\nCorrelation: % Black vs. Private Insurance\n")
print(cor(sdoh$ACS_PCT_BLACK, sdoh$ACS_PCT_PRIVATE_ANY, use = "complete.obs"))
cat("\nCorrelation: % Asian vs. Private Insurance\n")
print(cor(sdoh$ACS_PCT_ASIAN, sdoh$ACS_PCT_PRIVATE_ANY, use = "complete.obs"))
cat("\nCorrelation: % American Indian/Alaska Native vs. Private Insurance\n")
print(cor(sdoh$ACS_PCT_AIAN, sdoh$ACS_PCT_PRIVATE_ANY, use = "complete.obs"))
cat("\nCorrelation: % Multiracial vs. Private Insurance\n")
print(cor(sdoh$ACS_PCT_MULT_RACE, sdoh$ACS_PCT_PRIVATE_ANY, use = "complete.obs"))
}
##
## Correlation: % White vs. Private Insurance
## [1] 0.4260394
##
## Correlation: % Black vs. Private Insurance
## [1] -0.2949309
##
## Correlation: % Asian vs. Private Insurance
## [1] 0.2303914
##
## Correlation: % American Indian/Alaska Native vs. Private Insurance
## [1] -0.2617146
##
## Correlation: % Multiracial vs. Private Insurance
## [1] -0.1810702
Understanding the correlation between racial groups and private
insurance will help us improve healthcare equity and access by
highlighting whether or not various populations have more or less
coverage than others. We predicted race and private insurance coverage
having a strong correlation, with Whites having a strong positive
correlation and the other races having a strong negative correlation.
However, our findings proved otherwise; we saw a moderate positive
correlation between White and private insurance coverage and Asians
having a low positive correlation (0.23). Blacks and American
Indian/Alaska Natives had a negative low correlation (-0.29 and 0.261,
respectively). These findings might be because certain racial groups are
more likely to have public insurance coverage than others, with White
and Asian populations demonstrating the highest correlation with
increased health insurance coverage. Consequently, it appears that
various populations of color have, unjustly, have less coverage than
others, even in counties with the highest private insurance coverage
rates.
2: Various Education Levels
Now that we know more about different racial groups and their
relationship with private insurance coverage, we wanted to shift our
focus to the education level, as educational attainment is another key
social determinant of health that shapes life outcomes for many
individuals. An interest in investigating this relationship also arose
due to a research study finding that increased educational attainment
was associated with higher rates of health insurance coverage,
particularly in conjunction with the factor of employment (Dewar,
1998). Based on this finding, we predicted that higher educational
attainment would also be associated with higher private health insurance
rates specifically.
In this particular dataset, the percentage of people in each county
with a designated education level is encoded through the following
variables: “ACS_PCT_LT_HS” (less than high school education),
“ACS_PCT_HS_GRADUATE” (only high school diploma), “ACS_PCT_BACHELOR_DGR”
(holds a bachelor’s degree), and “ACS_PCT_GRADUATE_DGR” (holds a
graduate or professional degree).
Similar to our interest in looking at the distribution of the “race”
variable, we aimed to start with a similar investigation of the
“education level” variables to provide insight into what skews might be
present in the data that provide context for later interpretations. To
visualize the distribution of each educational variable, the ggplot()
function used with geom_histogram produces a set of histogram
distribution plots that show the distribution of the population
percentages within each educational level by county in the dataset. The
pivot_longer() function was used to put all the educational levels in
one column. We named this new combined variable “sdoh_long,” used
consistently throughout the project when creating a new combined
variable. Next, we renamed the original variables to educational levels
and created a histogram plot using the ggplot() function in order to
have clearer titles of the variable names.
#Visualize distribution of education level in US by county
#Convert data to long format to put into a facet wrapped plot
sdoh_long <- sdoh %>%
pivot_longer(cols = c(ACS_PCT_LT_HS, ACS_PCT_HS_GRADUATE, ACS_PCT_BACHELOR_DGR, ACS_PCT_GRADUATE_DGR),
names_to = "Education_Level",
values_to = "Percentage")
#Rename education levels for better readability/understanding
sdoh_long$Education_Level <- factor(sdoh_long$Education_Level,
levels = c("ACS_PCT_LT_HS", "ACS_PCT_HS_GRADUATE", "ACS_PCT_BACHELOR_DGR", "ACS_PCT_GRADUATE_DGR"),
labels = c("Less than High School", "High School Graduate", "Bachelor's Degree", "Graduate Degree"))
#Create the faceted histogram plot with all four plots in one figure
ggplot(sdoh_long, aes(x = Percentage)) +
geom_histogram(color = "grey", fill = "light blue", bins = 30) +
facet_wrap(~Education_Level, nrow = 2) +
xlab("% of county population") +
ylab("Number of counties") +
ggtitle("Distribution of Education Levels by County in the US") +
theme_minimal()

In the distribution for graduate and bachelor’s degrees, there is a
right skew to the distribution, which indicates that on the county
level, few people have graduate and bachelor’s degrees compared to the
other levels of education. The same observation can be seen with people
with less than a high school degree. However, the percentage of the
county population of high school graduates appears to have a more normal
distribution. From this, we can reason that the population we are
examining essentially comprises those with educational attainment below
a bachelor’s degree. Consequently, this difference in population size
across education levels may be the reason for any eventual difference
detected between individuals in different educational level groups.
Next, a summary output of basic statistical metrics such as mean and
median percentage for each education level group was calculated to
visualize the distribution numerically, which provides greater context
to our findings from the histogram distributions. The summary()
function, in this case, provides a basic sense of the variables’ numeric
distribution. The cat() function assists in printing all four summary
outputs simultaneously.
#Print all summary statistics for % of population with various education levels at once
{
cat("\n % of Population with Less than High School Diploma\n")
print(summary(sdoh$ACS_PCT_LT_HS))
cat("\n % of Population with Only High School Diploma\n")
print(summary(sdoh$ACS_PCT_HS_GRADUATE))
cat("\n % of Population with a Bachelor's Degree\n")
print(summary(sdoh$ACS_PCT_BACHELOR_DGR))
cat("\n % of Population with a Graduate or Professional Degree\n")
print(summary(sdoh$ACS_PCT_GRADUATE_DGR))
}
##
## % of Population with Less than High School Diploma
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.39 7.99 11.28 12.71 16.30 78.15
##
## % of Population with Only High School Diploma
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.52 29.25 34.20 33.84 39.07 54.96
##
## % of Population with a Bachelor's Degree
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.53 13.74 14.68 17.91 43.00
##
## % of Population with a Graduate or Professional Degree
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 4.98 6.63 7.93 9.54 47.64
This information shows that the mean is above the median for
populations with less than a high school diploma, a bachelor’s degree,
and a graduate or professional degree, indicating that each distribution
is slightly right-skewed. This means counties with high percentages of
these population groups are considered outliers compared to those with
lower percentages. For those with only a high school diploma, the median
is greater than the mean, which indicates a slight left skew and that
counties with lower percentages of these population groups are
considered outliers. Once again, we show that the population primarily
comprises those with only a high school diploma or less, indicating a
lower education level among the SDOH respondents. This may have affected
the data in a way that may not be seen with more respondents with higher
educational attainment.
After visualizing the distribution for each educational level, we
were curious about the correlation between these educational levels and
private health insurance ownership to understand the relationship
between education and health insurance coverage. The following ggplot()
functions used with geom_point create correlation plots for the
relationship between the education level variables and private insurance
ownership. The ggplot() functions are responsible for creating the plot,
with the lab (), lab (), and ggtitle() functions being used to describe
the ax’s names and titles for the plot.
#Plotting all correlation association plots together on the same plot for visualization
# Convert data to long format to put all plots together
sdoh_long <- sdoh %>%
pivot_longer(cols = c(ACS_PCT_LT_HS, ACS_PCT_HS_GRADUATE, ACS_PCT_BACHELOR_DGR, ACS_PCT_GRADUATE_DGR),
names_to = "Education_Level",
values_to = "Education_Percentage")
# Rename education levels for better readability/understanding
sdoh_long$Education_Level <- factor(sdoh_long$Education_Level,
levels = c("ACS_PCT_LT_HS", "ACS_PCT_HS_GRADUATE", "ACS_PCT_BACHELOR_DGR", "ACS_PCT_GRADUATE_DGR"),
labels = c("Less than High School", "High School Graduate", "Bachelor's Degree", "Graduate Degree"))
# Create the faceted scatter plot with trend lines for each relationship
ggplot(sdoh_long, aes(x = Education_Percentage, y = ACS_PCT_PRIVATE_ANY)) +
geom_point(aes(color = Education_Level), alpha = 0.5, size = 2) +
geom_smooth(method = "lm", color = "black", se = FALSE) +
facet_wrap(~Education_Level, nrow = 2) +
scale_color_manual(values = c("green", "blue", "red", "purple")) +
xlab("% of Population with Specified Education Level") +
ylab("% of Population with Private Insurance") +
ggtitle("Association between Education and Private Insurance Ownership") +
theme_minimal() +
theme(legend.position = "none")
## `geom_smooth()` using formula = 'y ~ x'

Based on the visual depictions seen in the correlation plots, higher
percentages of the population comprised of those with less than a high
school education and those with only a high school diploma is associated
with a decreased percentage of the population owning private health
insurance. Higher percentages of the population having a bachelor’s,
graduate, or professional degree are associated with an increased
percentage of the population owning private health insurance. This
indicates that higher education is associated with greater health
insurance coverage. Each plot also contains a trendline to depict the
general relationship between education and private health insurance:
higher education levels are associated with higher private health
insurance ownership.
Finally, the next evaluation was to assess the correlation between
education and private health insurance ownership through a correlation
coefficient value to understand the strength of the relationship beyond
just its direction. Using the cor() and cat() functions together also
prints the correlation coefficients for each of these
relationships.
#Print all correlation coefficients between education level and private insurance rates at once
{
cat("\n % of Population with Less than High School Education\n")
print(cor(sdoh$ACS_PCT_LT_HS, sdoh$ACS_PCT_PRIVATE_ANY))
cat("\n % of Population with Only High School Diploma\n")
print(cor(sdoh$ACS_PCT_HS_GRADUATE, sdoh$ACS_PCT_PRIVATE_ANY))
cat("\n % of Population with a Bachelor's Degree\n")
print(cor(sdoh$ACS_PCT_BACHELOR_DGR, sdoh$ACS_PCT_PRIVATE_ANY))
cat("\n % of Population with a Graduate or Professional Degree\n")
print(cor(sdoh$ACS_PCT_GRADUATE_DGR, sdoh$ACS_PCT_PRIVATE_ANY))
}
##
## % of Population with Less than High School Education
## [1] -0.6949437
##
## % of Population with Only High School Diploma
## [1] -0.2915207
##
## % of Population with a Bachelor's Degree
## [1] 0.5637536
##
## % of Population with a Graduate or Professional Degree
## [1] 0.4332939
As shown in the output, the correlation coefficients for the
variables, including those with less or only a high school diploma, have
negative coefficients (-0.69 and –0.29, respectively). In contrast,
those with a bachelor’s, graduate, or professional degree have positive
coefficients (0.56 and 0.43 respectively). These findings indicate that
our initial prediction of higher education being associated with
increased private health insurance ownership rates is potentially
correct. This suggests that populations with lower educational
attainment may have lower health insurance coverage rates and could
benefit from targeted support to increase healthcare access. Further
investigation is needed to assess further specifics of this relationship
and how to minimize this disparity potentially.
3: Limited English Speaking
Lastly, after looking at two broad variables, we wanted to
investigate a much narrower variable to see if the findings for the
broader categories were consistent. For this, we decided to explore how
English proficiency is related to health insurance coverage. According
to a study on the relationship between English proficiency and
healthcare access, “due to legal and policy contexts governing access to
resources, adults with LEP, who are less likely to be US-born, are
subjected to stricter health insurance coverage eligibility requirements
and exposed to greater complexity and administrative burden in
determining eligibility and acquiring and maintaining health insurance
coverage” (Ramirez
et al, 2023). Based on these previous findings, we predicted a
negative correlation between English proficiency and private health
insurance coverage.
This dataset encodes the percentage of limited English-speaking
households per county as “ACS_PCT_HH_LIMT_ENGLISH.” The summary()
function provides a basic sense of the distribution of this variable,
which allows for contextualization of the findings from this analysis
within the scope of our population.
#Summary statistics for % of limited English-speaking households
summary(sdoh$ACS_PCT_HH_LIMIT_ENGLISH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.290 0.870 3.469 2.160 90.340
The distribution is right-skewed since the mean (3.5%) is above the
median (0.87%). This means that most counties have a relatively low
percentage of limited English-speaking households, and counties with a
very high percentage are considered outliers. As such, our later
findings can be interpreted within this context while leaving space to
understand that in a dataset that has a greater representation of
counties with a high percentage of limited English-speaking households,
results may differ.
To visualize this distribution graphically, which provides the same
goal of contextualizing our research findings, we created a histogram
using the “ggplot2” package in R. We used the function ggplot(), which
takes the dataset and variable of interest as input and plots it along
the x-axis. The function geom_histogram specifies the type of graph we
want to make: a histogram. Within this function, we specified colors for
our graph. We added x and y-axis labels using xlab() and ylab().
Finally, we added a title using ggtitle(). Each function is linked to
our original graph using the “+” operator.
#Distribution of percentage of limited English-speaking households
ggplot(sdoh, aes(ACS_PCT_HH_LIMIT_ENGLISH)) + #selects variable to put on x axis
geom_histogram(color = 'grey', fill = 'light blue')+ #sets graph color
xlab("% of limited English-speaking households") + #adds x axis label
ylab("number of counties")+ #adds y axis label
ggtitle("Distribution of US Counties by Percentage of Limited English-Speaking Households")+ theme_minimal() #sets graph aesthetics built into ggplot package
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This graph depicts the right-skewed distribution of this variable,
as demonstrated by the summary statistics. The percentage of limited
English-speaking households is below 25% for most counties in this
dataset, indicating that English-speaking proficiency is generally low
in the SDOH population.
Next, two methods were used to evaluate the relationship between
limited English-speaking and health insurance coverage: a visual method
(by graph) and a quantitative method (by calculating the correlation
coefficient). For the graph, we used the function geom_point(), which
creates a scatter plot. Within this function, the plotted data’s color,
size, and opacity were also customized. Finally, x and y-axis labels, as
well as a title, were added using the xlab(), ylab(), and ggtitle()
functions. The purpose of this plot is to understand the direction of
the relationship between English proficiency and private health
insurance coverage.
#Correlation between limited English-speaking and private insurance coverage
ggplot(sdoh, aes(x = ACS_PCT_HH_LIMIT_ENGLISH, y = ACS_PCT_PRIVATE_ANY)) +
geom_point(color = 'blue',alpha = 0.5, size = 2) +
geom_smooth(method = "lm", color = "black", se = FALSE) +
xlab("% of limited English-speaking households") +
ylab("% of population with private insurance")+
ggtitle("Association between English-speaking and private insurance ownership")+
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

This graph illustrates a negative correlation between the percentage
of individuals from limited English-speaking households and those with
private insurance coverage by county. In other words, this indicates
that as the number of limited English-speaking households increases, the
percentage of the county with private health insurance decreases
immensely, indicating a potential inequity regarding health insurance
access among populations with limited English profieincy.
The last piece of analysis for evaluating the relationship between
limited English-speaking and insurance coverage was calculating the
correlation coefficient to assess the relationship’s strength. This
provides a quantitative measure of the relationship illustrated on the
scatterplot. R has a built-in function called “cor()” that automatically
calculates the degree of association between two variables.
##Calculate correlation coefficient
cor(sdoh$ACS_PCT_HH_LIMIT_ENGLISH, sdoh$ACS_PCT_PRIVATE_ANY)
## [1] -0.3865378
The correlation coefficient was calculated as –0.39, which indicates
a moderate negative association between limited English-speaking and
insurance coverage. In other words, as the percentage of limited
English-speaking households in a county increases, the percentage of
residents with healthcare insurance coverage decreases.
The association between the percentage of individuals from limited
English-speaking households and those with private insurance coverage
reflected in the graph and by the correlation coefficient is consistent
with our prediction. Further investigation is necessary to understand
the causal relationship between these variables and how this disparity
in health insurance might negatively affect healthcare outcomes for
limited English-speaking populations.
1. Social Context = demographics, socioeconomic status